Identifying an object within content

ABSTRACT

A method for identifying an object within a video sequence, wherein the video sequence comprises a sequence of images, wherein the method comprises, for each of one or more images of the sequence of images: using a first neural network to determine whether or not an object of a predetermined type is depicted within the image; and in response to the first neural network determining that an object of the predetermined type is depicted within the image, using an ensemble of second neural networks to identify the object determined as being depicted within the image.

FIELD OF THE INVENTION

The present invention relates to methods, systems and computer programsfor identifying an object within content.

BACKGROUND OF THE INVENTION

It is often desirable to be able to identify particular objects orpatterns or characteristics within content (such as images, videosequences and audio content). This can be carried out for activitiessuch as facial recognition, logo detection, product placement, voicerecognition, etc. Various systems currently exist to enable suchidentification.

It would, however, be desirable to provide improved objectidentification, in terms of speed and accuracy of the results.

SUMMARY OF THE INVENTION

According to a first aspect of the invention, there is provided a methodfor identifying an object within a video sequence, wherein the videosequence comprises a sequence of images, wherein the method comprises,for each of one or more images of the sequence of images: using a firstneural network to determine whether or not an object of a predeterminedtype is depicted within the image; and in response to the first neuralnetwork determining that an object of the predetermined type is depictedwithin the image, using an ensemble of second neural networks toidentify the object determined as being depicted within the image.

The first neural network may be a convolutional neural network or a deepconvolutional neural network.

One or more of the second neural networks may be a convolutional neuralnetwork or a deep convolutional neural network.

In some embodiments, using a first neural network to determine whetheror not an object of a predetermined type is depicted within the imagecomprises: generating a plurality of candidate images from the image;using the first neural network to determine, for each of the candidateimages, an indication of whether or not an object of the predeterminedtype is depicted in said candidate image; and using the indications todetermine whether or not an object of the predetermined type is depictedwithin the image. One or more of the candidate images may be generatedfrom the image by performing one or more geometric transformations on anarea of the image.

The predetermined type may, for example, be a logo, a face or a person.

In some embodiments, the method comprises associating metadata with theimage based on the identified object.

According to a second aspect of the invention, there is provided amethod of determining unauthorized use of a video sequence, the methodcomprising: obtaining a video sequence from a source; and usingaccording to any embodiment of the first aspect, when the predeterminedtype is a logo, to identify whether or not a logo is depicted within oneor more images of the video sequence. The logo may be one of a pluralityof predetermined logos.

According to a third aspect of the invention, there is provided a methodfor identifying an object within an amount of content, the methodcomprising: using a first neural network to determine whether or not anobject of a predetermined type is depicted within the amount of content;and in response to the first neural network determining that an objectof the predetermined type is depicted within the amount of content,using an ensemble of second neural networks to identify the objectdetermined as being depicted within the amount of content.

The amount of content may be an image or an audio snippet.

The first neural network may be a convolutional neural network or a deepconvolutional neural network.

One or more of the second neural networks may be a convolutional neuralnetwork or a deep convolutional neural network.

In some embodiments, using a first neural network to determine whetheror not an object of a predetermined type is depicted within the amountof content comprises: generating a plurality of content candidates fromthe amount of content; using the first neural network to determine, foreach of the content candidates, an indication of whether or not anobject of the predetermined type is depicted in said content candidate;and using the indications to determine whether or not an object of thepredetermined type is depicted within the amount of content. The one ormore of the content candidates may be generated from the amount ofcontent by performing one or more geometric transformations on a portionof the amount of content.

In some embodiments, the amount of content is an audio snippet and thepredetermined type is one of: a voice; a word; a phrase.

In some embodiments, the method comprises associating metadata with theamount of content based on the identified object.

According to a fourth aspect of the invention, there is provided anapparatus arranged to carry out a method according to any embodiment ofthe first to third aspects of the invention.

In particular, there may be provided a system for identifying an objectwithin a video sequence, wherein the video sequence comprises a sequenceof images, wherein the system comprises: an input arranged to receive animage of the sequence of images; first neural network arranged todetermine whether or not an object of a predetermined type is depictedwithin the image; an ensemble of second neural networks, the ensemblearranged to, in response to the first neural network determining that anobject of the predetermined type is depicted within the image, identifythe object determined as being depicted within the image.

The first neural network may be a convolutional neural network or a deepconvolutional neural network.

One or more of the second neural networks may be a convolutional neuralnetwork or a deep convolutional neural network.

In some embodiments, the system comprises a candidate image generatorarranged to generate a plurality of candidate images from the image,wherein the first neural network is arranged to determine whether or notan object of a predetermined type is depicted within the image by:determining, for each of the candidate images, an indication of whetheror not an object of the predetermined type is depicted in said candidateimage; and using the indications to determine whether or not an objectof the predetermined type is depicted within the image. One or more ofthe candidate images may be generated from the image by performing oneor more geometric transformations on an area of the image.

The predetermined type may, for example, be a logo, a face or a person.

In some embodiments, the system is arranged to associate metadata withthe image based on the identified object.

There may be provided a system arranged to determine unauthorized use ofa video sequence, the system comprising: an input for obtaining a videosequence from a source; and a system as set out above, arranged toidentify whether or not a logo is depicted within one or more images ofthe video sequence. The logo may be one of a plurality of predeterminedlogos.

There may be provided a system for identifying an object within anamount of content, the system comprising: a first neural networkarranged to determine whether or not an object of a predetermined typeis depicted within the amount of content; and an ensemble of secondneural networks, the ensemble arranged, in response to the first neuralnetwork determining that an object of the predetermined type is depictedwithin the amount of content, to identify the object determined as beingdepicted within the amount of content.

The amount of content may be an image or an audio snippet.

The first neural network may be a convolutional neural network or a deepconvolutional neural network.

One or more of the second neural networks may be a convolutional neuralnetwork or a deep convolutional neural network.

In some embodiments, the system comprises a candidate generator arrangedto generate a plurality of content candidates from the amount ofcontent, wherein the first neural network is arranged to determinewhether or not an object of a predetermined type is depicted within theamount of content by: using the first neural network to determine, foreach of the content candidates, an indication of whether or not anobject of the predetermined type is depicted in said content candidate;and using the indications to determine whether or not an object of thepredetermined type is depicted within the amount of content. The one ormore of the content candidates may be generated from the amount ofcontent by performing one or more geometric transformations on a portionof the amount of content.

In some embodiments, the amount of content is an audio snippet and thepredetermined type is one of: a voice; a word; a phrase.

In some embodiments, the system is arranged to associate metadata withthe amount of content based on the identified object.

According to a fifth aspect of the invention, there is provided acomputer program which, when executed by one or more processors, causesthe one or more processors to carry out a method according to anyembodiment of the first to third aspects of the invention. The computerprogram may be stored on a computer-readable medium.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of exampleonly, with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an example of a computer system;

FIG. 2 schematically illustrates a system according to some embodimentsof the invention;

FIG. 3 schematically illustrates example samples for training neuralnetworks according to some embodiments of the invention;

FIG. 4 is a flowchart illustrating a method of using the system of FIG.2 according to some embodiments of the invention;

FIG. 5 schematically illustrates generation of candidate images by acandidate image generator according to some embodiments of theinvention;

FIG. 6 schematically illustrates an example deployment scenario for thesystem of FIG. 2 according to some embodiments of the invention; and

FIG. 7 is a flowchart illustrating an example method according to someembodiments of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

In the description that follows and in the figures, certain embodimentsof the invention are described. However, it will be appreciated that theinvention is not limited to the embodiments that are described and thatsome embodiments may not include all of the features that are describedbelow. It will be evident, however, that various modifications andchanges may be made herein without departing from the broader spirit andscope of the invention as set forth in the appended claims.

1—System Overview

FIG. 1 schematically illustrates an example of a computer system 100.The system 100 comprises a computer 102. The computer 102 comprises: astorage medium 104, a memory 106, a processor 108, an interface 110, auser output interface 112, a user input interface 114 and a networkinterface 116, which may be linked together over one or morecommunication buses 118.

The storage medium 104 may be any form of non-volatile data storagedevice such as one or more of a hard disk drive, a magnetic disc, asolid-state-storage device, an optical disc, a ROM, etc. The storagemedium 104 may store an operating system for the processor 108 toexecute in order for the computer 102 to function. The storage medium104 may also store one or more computer programs (or software orinstructions or code).

The memory 106 may be any random access memory (storage unit or volatilestorage medium) suitable for storing data and/or computer programs (orsoftware or instructions or code).

The processor 108 may be any data processing unit suitable for executingone or more computer programs (such as those stored on the storagemedium 104 and/or in the memory 106), some of which may be computerprograms according to embodiments of the invention or computer programsthat, when executed by the processor 108, cause the processor 108 tocarry out a method according to an embodiment of the invention andconfigure the system 100 to be a system according to an embodiment ofthe invention. The processor 108 may comprise a single data processingunit or multiple data processing units operating in parallel, separatelyor in cooperation with each other. The processor 108, in carrying outdata processing operations for embodiments of the invention, may storedata to and/or read data from the storage medium 104 and/or the memory106.

The interface 110 may be any unit for providing an interface to a device122 external to, or removable from, the computer 102. The device 122 maybe a data storage device, for example, one or more of an optical disc, amagnetic disc, a solid-state-storage device, etc. The device 122 mayhave processing capabilities—for example, the device may be a smartcard. The interface 110 may therefore access data from, or provide datato, or interface with, the device 122 in accordance with one or morecommands that it receives from the processor 108.

The user input interface 114 is arranged to receive input from a user,or operator, of the system 100. The user may provide this input via oneor more input devices of the system 100, such as a mouse (or otherpointing device) 126 and/or a keyboard 124, that are connected to, or incommunication with, the user input interface 114. However, it will beappreciated that the user may provide input to the computer 102 via oneor more additional or alternative input devices (such as a touchscreen). The computer 102 may store the input received from the inputdevices via the user input interface 114 in the memory 106 for theprocessor 108 to subsequently access and process, or may pass itstraight to the processor 108, so that the processor 108 can respond tothe user input accordingly.

The user output interface 112 is arranged to provide a graphical/visualand/or audio output to a user, or operator, of the system 100. As such,the processor 108 may be arranged to instruct the user output interface112 to form an image/video signal representing a desired graphicaloutput, and to provide this signal to a monitor (or screen or displayunit) 120 of the system 100 that is connected to the user outputinterface 112. Additionally or alternatively, the processor 108 may bearranged to instruct the user output interface 112 to form an audiosignal representing a desired audio output, and to provide this signalto one or more speakers 121 of the system 100 that is connected to theuser output interface 112.

Finally, the network interface 116 provides functionality for thecomputer 102 to download data from and/or upload data to one or moredata communication networks.

It will be appreciated that the architecture of the system 100illustrated in FIG. 1 and described above is merely exemplary and thatother computer systems 100 with different architectures (for examplewith fewer components than shown in FIG. 1 or with additional and/oralternative components than shown in FIG. 1) may be used in embodimentsof the invention. As examples, the computer system 100 could compriseone or more of: a personal computer; a server computer; a mobiletelephone; a tablet; a laptop; a television set; a set top box; a gamesconsole; other mobile devices or consumer electronics devices; etc.

FIG. 2 schematically illustrates a system 200 according to an embodimentof the invention. The system 200 may be used to detect and identify anobject (or feature or pattern) depicted in (or represented in or presentin) a video sequence. The system 200 is concerned with detecting andidentifying objects of a predetermined type (i.e. objects that belong toa particular/specific class/group/category of objects). Therefore, thesystem 200 may be configured for a corresponding predetermined type ofobject, i.e. one embodiment of the system 200 may be configured for afirst predetermined type of object, whilst a different embodiment of thesystem 200 may be configured for a second, different, predetermined typeof object.

For example, the object may be a logo of a television broadcaster, whichis often depicted in (or overlaid onto) broadcast television images(usually in one of the corners of the images). In this example, thepredetermined type of object may be “broadcaster logo in general and thevideo sequence may be, for example, a television broadcast. The system200 may then be arranged to detect whether an object of thepredetermined type (i.e. a broadcaster logo) is depicted in thetelevision broadcast and, if so, to then identify which particularobject (i.e. which particular broadcaster logo) is depicted in thetelevision broadcast. Other example scenarios of different types ofobject are possible, as shall be discussed in more detail later.

For ease of understanding, in the following, embodiments of theinvention shall sometimes be described with reference to thepredetermined type of object being “broadcaster logo”, as discussedabove. However, it will be appreciated that embodiments of the inventionare not restricted to this predetermined type of object.

The system 200 comprises an input 204, a first neural network 208, anensemble 210 of second neural networks 212, and an optional candidateimage generator 206. For ease of reference, the first neural network 208shall be referred to as NN₁. As shown in FIG. 2, the ensemble 210 ofsecond neural networks 212 comprises (or makes use of, or is acollection or group of) a plurality of second neural networks 212. Thenumber of second neural networks 212 shall be referred to herein as M(for some integer M>1) and, for ease of reference, the second neuralnetworks 212 shall be referred to respectively as NN_(2, k) (k=1, 2, . .. , M). The system 200 may be implemented, for example, using one ormore computer systems 100 of FIG. 1.

The input 204 is arranged to receive images of a video sequence 202. Thevideo sequence 202 comprises a sequence (or series) of images F_(k)(k=1, 2, 3, . . . ). Each image F_(k) (k≥1) may be, for example, a videoframe or one of two video fields of a video frame, as are known in thisfield of technology. The images F_(k) (k≥1) may be at any resolution(such as at the resolution for any of the NTSC, PAL and high definitionstandards).

As shall be described in more detail below, the system 200 processes thevideo sequence 202 on an image-by-image basis, i.e. each image F_(k)(k=1, 2, 3, . . . ) of the video sequence 202 may be processedindependently from the other images of the video sequence 202. Thus, inFIG. 2, and in the subsequent discussion, the image from the videosequence 202 currently processed by the system 200 is the image F_(j)(for some integer j≥1), also referred to as the “current image”.

The input 204 may take many forms. For example:

-   -   The video sequence 202 may be part of a television broadcast,        video-on-demand, pay TV, etc., in which case, the input 204 may        comprise, or may make use of, a television receiver for        receiving a television signal (such as a terrestrial television        broadcast, a digital video broadcast, a cable television signal,        a satellite television signal, etc.).    -   The video sequence 202 may be video distributed over a network        (such as the Internet), in which case the input 204 may        comprise, or may make use of, one or more network connections        (such as the network interface 116) for connecting to a network        (such as the Internet) so that video can be acquired or obtained        via that network.    -   The video sequence 202 may be stored on a medium local to the        system 200 (such as the storage medium 104), with the input 204        being arranged to read images from the video sequence 202 stored        on the medium.

The candidate image generator 206 is arranged to generate a plurality ofcandidate images C_(k) (k=1, 2, . . . , N) based on the current image F₁(for some integer N>1). One of the candidate images C_(k) may be thesame as the current image F_(j). The subsequent processing for thecurrent image F₁ is then based on the candidate images C_(k) (k=1, 2, .. . , N). The operation of the candidate image generator 206 shall bedescribed in more detail shortly with reference to FIG. 5.

However, as mentioned, the candidate image generator 206 is optional.Thus, in some embodiments that do not utilize the candidate imagegenerator 206, a plurality of candidate images C_(k) is notgenerated—instead, the subsequent processing is based only on thecurrent image F_(j). Thus, in the following, the number of candidateimages may be viewed as 1 (i.e. N=1), with C_(j) equal to F₁, i.e. thecurrent image F_(j) may be considered to be a candidate image C₁ (and,indeed, the only candidate image).

The first neural network NN₁ is responsible for determining (ordetecting or identifying) whether or not an object of a predeterminedtype is depicted within the current image F_(j). The first neuralnetwork NN₁ carries out this processing based on the candidate image(s)C_(k) (k=1, 2, . . . , N). If an object of the predetermined type isdetermined as being depicted within the current image F_(j), then theensemble 210 of second neural networks NN_(2,k) (k=1, 2, . . . , M) isresponsible for identifying (or classifying or recognising) the objectthat has been determined as being depicted within the current imageF_(j). Thus, in the example of the predetermined type of object being“broadcaster logo”, the first neural network NN₁ is responsible fordetermining whether or not a broadcaster logo is depicted in an imageF_(j) from a television broadcast and, if so, the ensemble 210 of secondneural networks NN_(2,k) (k=1, 2, . . . , M) is responsible foridentifying which particular broadcaster logo is depicted within thecurrent image F_(j).

The first neural network NN₁ and each of the second neural networksNN_(2,k) (k=1, 2, . . . , M) may be any kind of neural network.Preferably, each of NN₁ and NN_(2,k) (k=1, 2, . . . , M) is aconvolutional neural network (CNN) or, more preferably, a deep CNN,because these types of neural networks have been shown to beparticularly well-suited to image analysis tasks. CNNs are well-known(see, for example,https://en.wikipedia.org/wiki/Convolutional_neural_network, the entiredisclosure of which is incorporated herein by reference) and they shallnot, therefore, be described in more detail herein. Examples of CNNarchitectures that embodiments of the invention may use for NN₁ andNN_(2,k) (k=1, 2, . . . , M) include:

-   -   The AlexNet architecture (see, A. Krizhevsky, I. Sutskever,        and G. E. Hinton, “ImageNet Classification with Deep        Convolutional Neural Networks” in Advances in Neural Information        Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou,        and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp.        1097-1105 and        http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf,        the entire disclosure of which is incorporated herein by        reference). AlexNet consists of five convolutional layers and        three fully connected dense layers.    -   The VGGNet architecture (see K. Simonyan and A. Zisserman, “Very        deep convolutional networks for large-scale image recognition”,        CoRR, vol. abs/1409.1556, 2014 and        http://arxiv.org/abs/1409.1556, the entire disclosure of which        is incorporated herein by reference). VGGNet consists of 13        convolutional and 3 fully connected dense layers, with a regular        structure. In VGGNet, the basic building block consists of two        or three stacked convolutional layers of the same size, followed        by a 2×2 MaxPooling layer. This building block is repeated five        times, with the number of filters doubling from 64 up to 512        filters per channel in the last block.    -   The ResNet architecture (see K. He, X. Zhang, S. Ren, and J.        Sun, “Deep residual learning for image recognition” CoRR, vol.        abs/1512.03385, 2015 and http://arxiv.org/abs/1512.03385, the        entire disclosure of which is incorporated herein by reference).        ResNet has a homogeneous structure, which consists of stacked        residual blocks. Each residual block consists of two stacked        convolutional layers, with the input to the residual block,        despite going to the first convolutional layer, also added to        the output of the residual block.    -   It will be appreciated that modifications to the AlexNet, VGGNet        and ResNet architectures are possible to arrive at other CNN        architectures. It will also be appreciated that other CNN        architectures are possible.

Each of the second neural networks NN_(2,k) (k=1, 2, . . . , M) isdifferent from the other second neural networks NN_(2,b) (b=1, 2, . . ., M, b≠k), in that: (i) NN_(2,k) uses its own respective neural networkarchitecture that is different from the architecture for NN_(2,b) and/or(ii) NN_(2,k) was trained/initialized using a different training set ofsamples than that used for NN_(2,b). Thus, the ensemble 210 may obtainresults from each of the different second neural networks NN_(2,k) (k=1,2, . . . , M) and use those results to provide a final outputcorresponding to the current image F_(j). Ensembles of neural networks,and ways to combine outputs from multiple neural networks to obtain asingle output for a task, are well-known (see, for example,https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning), theentire disclosure of which is incorporated herein by reference) and theyshall not, therefore, be described in more detail herein except whereuseful for further understanding of embodiments of the invention.

The architecture used for the first neural network NN₁ may be the sameas the architecture used for one or more of the second neural networksNN_(2,k) (k=1, 2, . . . , M) or may be different from the architectureused for all of the second neural networks NN_(2,k) (k=1, 2, . . . , M).However, as discussed below, given the different tasks that the firstneural network NN₁ and the ensemble 210 of second neural networksNN_(2,k) (k=1, 2, . . . , M) have to perform, they are trained usingdifferent respective training sets of samples.

2—Neural Network Training

As is well-known in the field of neural networks, a neural network needsto be trained for it to carry out a specific task, with the trainingbased on samples.

Each of the second neural networks NN_(2,k) (k=1, 2, . . . , M) is, asdiscussed above, to be used to identifying an object that has beendetermined as being depicted within the current image F_(j). It is,therefore, assumed that there is a set of particular objects of thepredetermined type that the system 200 is to be used to try to identify.Let the number of such objects be represented by T, and this set ofparticular objects of the predetermined type be {O₁, O₂, . . . , O_(T)}.Indeed, specifying the set of objects {O₁, O₂, . . . , O_(T)} ofinterest may, in itself, define the predetermine type of object. Forexample, continuing the “broadcaster logo” example, there may be Tbroadcaster logos which the system 200 is intended toidentify/discriminate, and object O_(k) (k=1, 2, . . . , T) is thek^(th) broadcaster logo.

A first set of samples S₁ may be generated, where each sample in S₁ isan image depicting one of the objects O_(k) (k=1, 2, . . . , T), and,for each object O_(k) (k=1, 2, . . . , T), S₁ comprises a plurality ofimages depicting that object O_(k). The set of samples S₁ therefore hasT “classes” or “types” of sample (one for each object O_(k) (k=1, 2, . .. , T)). Each second neural network NN_(2,k) (k=1, 2, . . . , M) may betrained based on this set of samples S₁. The skilled person willappreciate that the number of images depicting each object O_(k) (k=1,2, . . . , T) within the set of samples S₁ may be chosen to besufficiently large so that the training of each second neural networkNN_(2,k) (k=1, 2, . . . , M) is successful. For example, each of the Tclasses of samples for the set of samples S₁ may comprise around 5000samples. Thus, each of the second neural networks NN_(2,k) (k=1, 2, . .. , M) is trained to distinguish between, or identify, the T specificobjects as depicted in input images.

In some embodiments, each image in the set of samples S₁ may begenerated by: (1) obtaining an image depicting one of the objects O_(k)(k=1, 2, . . . T); (2) identifying where in the image the object O_(k)is depicted; and (3) generating the sample by cropping a fixed size partof the image around the object O_(k). Such samples shall be called“padded samples”. This results in samples having the same dimensions,regardless of which object O_(k) is depicted therein. Step (2) may becarried out manually, or may be automated (e.g. in the “broadcasterlogo” example, the location of the logo object O_(k) for a particularbroadcaster may be known to be a predetermined position within animage).

In some embodiments, each image in the set of samples S₁ may begenerated by: (1) obtaining an image depicting one of the objects O_(k)(k=1, 2, . . . T); (2) identifying where in the image the object O_(k)is depicted; and (3) generating the sample by cropping a fixed sizearea/border around/from the object O_(k). Such samples shall be called“non-padded samples”. This results in samples having the same aspectratio as the object O_(k) depicted therein, although different samplesmay then have different aspect ratios. Step (2) may be carried outmanually, or may be automated (e.g. in the “broadcaster logo” example,the location of the logo object O_(k) for a particular broadcaster maybe known to be a predetermined position within an image).

In some embodiments, both padded and non-padded samples may be used inthe set of samples S₁. It will also be appreciated that the set ofsamples S₁ may comprise samples generated by other means (e.g. by simplyusing the original images that depict the object O_(k)) in addition to,or as alternatives to, the padded and/or non-padded samples.

FIG. 3 schematically illustrates the above-mentioned samples. Twooriginal images 300 a, 300 b are illustrated, that each depict acorresponding object 302 a, 302 b of the predetermined type. Paddedsamples 304 a, 304 b may be generated which, as can be seen, have thesame aspect ratio and the same overall size regardless of the objectdepicted therein. Non-padded samples 306 a, 306 b may be generatedwhich, as can be seen, have different aspect ratios (due to thedifferent dimensions of the objects 302 a, 302 b), but have a same sizedboundary around the objects 302 a, 302 b.

In some embodiments, additional samples in the set of samples S₁ may begenerated by, for example, (i) applying one or more geometrictransformations (such as a rotation, shear, scaling (zoom-in orzoom-out)) to the original images 300 a, 300 b and generating paddedsamples and/or non-padded samples from the transformed images, so thatsamples depicting the objects O_(k) (k=1, 2, . . . , T) in differenttransformed configurations are obtained; and/or (ii) adjusting where,within the sample, the object O_(k) (k=1, 2, . . . , T) is located (e.g.instead of being centred within the sample, the object O_(k) could beoffset from the centre of the sample).

It will be appreciated that each second neural network NN_(2,k) (k=1, 2,. . . , M) may be trained using its own respective set of samples S₁ asopposed to them all being trained using the same set of samples S₁.

The first neural network NN₁ is, as discussed above, to be used todetermine whether or not an object of the predetermined type is depictedwithin the current image F_(j) (without consideration of whichparticular object O_(k) (k=1, 2, . . . , T) is depicted). A second setof samples S₂ may be generated, where the second set of samples S₂comprises images without an object of the predetermined type depictedtherein, and images with an object of the predetermined type depictedtherein. The set of samples S₂ therefore has 2 “classes” or “types” ofsample (a first class of samples that do not depict an object of thepredetermined type, and a second class of samples that do depict anobject of the predetermined type). Preferably, the second set of samplesS₂ comprises, for each of the objects O_(k) (k=1, 2, . . . T), aplurality of images that depict that object O_(k). The skilled personwill appreciate that the number of images depicting an object of thepredetermined type and the number of images not depicting an object ofthe predetermined type within the second set of samples S₂ may be chosento be sufficiently large so that the training of the first neuralnetwork NN₁ is successful. For example, each of the 2 classes of samplesfor the set of samples S₂ may comprise around 5000T samples. Indeed, thesecond class for the second set of samples S₂ may comprise the first setof samples S₁—the first class for the second set of samples S₂ may thencomprise a substantially similar number of samples as the second classfor the second set of samples S₂. Thus, the first neural network NN₁ istrained to distinguish between, or identify, two types of image, namelyimages depicting an object of the predetermined type and images notdepicting an object of the predetermined type.

For the second set of samples S₂, the samples that depict an object ofthe predetermined type (i.e. samples for the second class) may beobtained in a similar manner to the samples for the first set of samplesS₁, for example by generating padded and/or non-padded samples as shownin FIG. 3.

It will be appreciated that the first and/or second sets of samples S₁and S₂ may be generated in different ways, and that the first neuralnetwork NN₁ and/or the second neural networks NN_(2,k) (k=1, 2, . . . ,M) may be trained in different ways so as to still be able to carry outtheir respective tasks.

3—Object Detection and Identification

FIG. 4 is a flowchart illustrating a method 400 of using the system 200of FIG. 2 according to some embodiments of the invention. The method 400is a method for identifying an object within the video sequence 202. Themethod 400 may be carried out by a computer system 100 as describedabove with reference to FIG. 1. The method 400 assumes that the firstneural network NN₁ and the second neural networks NN_(2,k) (k=1, 2, . .. , M) have been trained so as to be able to carry out their respectivetasks, as discussed above.

At a step 402, the system 200 uses the input 204 to obtain the currentimage F_(j) from the video sequence 202. The input 204 may activelyobtain the image F_(j) (e.g. extract an image from a broadcasttelevision signal) or may be provided with an image F₁ (e.g. the system200 may be instructed to test specific images F₁ that are provided tothe system 200).

If the system 200 makes use of the candidate image generator 206 thenthe method 400 comprises a step 404 at which the candidate imagegenerator 206 generates the plurality of candidate images C_(k) (k=1, 2,. . . , N). If the system 200 does not make use of the candidate imagegenerator 206 then the method 400 does not comprises the step 404 and,instead, there is only one candidate image C₁ (i.e. N=1) which is thecurrent image (i.e. C₁=F_(j)).

At a step 406, the first neural network NN₁ is used to determine whetheror not an object of the predetermined type is depicted within thecandidate image(s) C_(k) (k=1, 2, . . . , N). The step 406 comprisesproviding each of the candidate image(s) C_(k) (k=1, 2, . . . , N) as aninput to the first neural network NN₁ and using the first neural networkNN₁ to determine whether or not an object of the predetermined type ispresent in that candidate image C_(k). Thus, for each candidate imageC_(k) (k=1, 2, . . . , N), the first neural network NN₁ provides anindication of whether or not an object of the predetermined type isdepicted in that candidate image C_(k). Continuing the “broadcasterlogo” example, at the step 406, the first neural network NN₁ is used totest each of the candidate image(s) C_(k) (k=1, 2, . . . , N) to checkwhether or not a broadcaster logo is present in that candidate imageC_(k).

Thus, the first neural network NN₁ produces a result R_(k) for eachcandidate image C_(k) (k=1, 2, . . . , N). The result R_(k) may takemany forms. For example:

-   -   In some embodiments, R_(k) assumes one of two values:either a        first value V₁ (e.g. TRUE) to indicate that the candidate image        C_(k) depicts an object of the predetermined type or a second        value V₂ (e.g. FALSE) to indicate that the candidate image C_(k)        does not depict an object of the predetermined type. An object        of the predetermined type may therefore be determined to be        depicted in the current image F_(j) if R_(k)=1/1 for at least a        threshold number β₁ of the candidate images C_(k) (k=1, 2, . . .        , N). In some embodiments, the threshold number β₁ may be 1, so        that detection by the first neural network NN₁ in a single        candidate image C_(k) (k=1, 2, . . . , N) is sufficient to        conclude that the current image F_(j) depicts an object of the        predetermined type. In some embodiments, the threshold number β₁        may be greater than 1.    -   In some embodiments, R_(k) is a confidence value indicating a        likelihood that the candidate image C_(k) depicts an object of        the predetermined type. For example, the confidence value may be        in the range from 0 to 1. In the following, it is assumed that        higher confidence values are indications of a higher likelihood        that the candidate image C_(k) depicts an object of the        predetermined type—however, it will be appreciated that the        opposite may be true, and that embodiments of the invention may        be adapted accordingly. An object of the predetermined type may        therefore be determined to be depicted in the current image        F_(j) if R_(k) is greater than a predetermined threshold β₂ for        at least a threshold number β₁ of the candidate images C_(k)        (k=1, 2, . . . , N). In some embodiments, the threshold number        β₁ may be 1, so that detection by the first neural network NN₁        in a single candidate image C_(k) (k=1, 2, . . . , N) is        sufficient to conclude that the current image F_(j) depicts an        object of the predetermined type. In some embodiments, the        threshold number β₁ may be greater than 1. Alternatively, an        object of the predetermined type may be determined to be        depicted in the current image F_(j) if a combination of R_(k)        (k=1, 2, . . . , N) is greater than a predetermined threshold        β₃, e.g. if a product (Π_(k=1) ^(N)R_(k)) or a linear        combination Σ_(k=1) ^(N)θ_(k)R_(k) for some positive        coefficients θ_(k) of the R_(k) values exceeds β₃.

Thus, at a step 408, the system 200 uses the results of the step 406 todetermine whether an object of the predetermined type is depicted in thecurrent image F_(j). This may be carried out in the manner set outabove. However, it will be appreciated that the result R_(k)corresponding to the candidate image C_(k) (k=1, 2, . . . , N) may takeother forms, and that other methods could be used for using the resultsR_(k) (k=1, 2, . . . , N) to determine whether or not an object of thepredetermined type is depicted in the current image F₁. Thus, together,the steps 406 and 408 involve using the first neural network NN₁ todetermine whether or not an object of the predetermined type is depictedwithin the current image F₁.

If the system 200 determines that an object of the predetermined type isnot depicted in the current image F_(j), then processing continues at anoptional step 416 at which the system 200 may carry out processingspecific to the situation in which no object of the predetermined typeis detected in the current image F_(j). For example, the system 200 mayexpect to always detect an object of the predetermined type, so thatfailure to detect such an object may be viewed as an error or an anomalywhich needs to be logged to flagged for further investigation.Processing then continues at a step 418, as discussed later.

If the system 200 determines that an object of the predetermined type isdepicted in the current image F_(j), then processing continues at a step410. The step 410 is reached when there are one or more candidate imagesC_(k) (k=1, 2, . . . , N) in which the first neural network NN₁ haddetermined that an object of the predetermined type is depicted. Letthere be L such candidate images C_(k) ₁ , C_(k) ₂ , . . . , C_(k) _(L)in which the first neural network NN₁ had determined that an object ofthe predetermined type is depicted. Let these candidate images C_(k)_(b) (b=1, 2, . . . , L) be called “positive candidate images”. Forexample, in the above-mentioned embodiments in which R_(k) assumes oneof two values (either a first value V₁ (e.g. TRUE) to indicate that thecandidate image C_(k) depicts an object of the predetermined type or asecond value V₂ (e.g. FALSE) to indicate that the candidate image C_(k)does not depict an object of the predetermined type), the positivecandidate images C_(k) _(b) (b=1, 2, . . . , L) are those candidateimages C_(k) for which R_(k)=V₁. Likewise, in the above-mentionedembodiments in which R_(k) is a confidence value indicating a likelihoodthat the candidate image C_(k) depicts an object of the predeterminedtype, the positive candidate images C_(k) _(b) (b=1, 2, . . . , L) arethose candidate images C_(k) for which R_(k) is greater than β₂.

At the step 410, each of the second neural networks NN_(2,k) (k=1, 2, .. . , M) is used to identify which object of the predetermined type isdepicted within each of the positive candidate images C_(k) _(b) (b=1,2, . . . , L). The step 410 comprises, for each second neural networkNN_(2,k) (k=1, 2, . . . , M), providing each of the positive candidateimage(s) C_(k) _(b) (b=1, 2, . . . , L) as an input to that secondneural network NN_(2,k) and using that second neural network NN_(2,k) togenerate a corresponding result S_(k,b). The result S_(k,b) produced bythe second neural network NN_(2,k) for positive candidate image C_(k)_(b) may take many forms. For example:

-   -   S_(k,b) may be an indication of one object from the set of        objects {O₁, O₂, . . . , O_(T)} that the second neural network        NN_(2,k) determines to be the most likely object depicted in the        positive candidate image C_(k) _(b) . The results S_(k,b) (k=1,        2, . . . , M; b=1, 2, . . . , L) may then be combined by        identifying an object most frequently indicated by the set of        results {S_(k,b): k=1, 2, . . . , M; b=1, 2, . . . , L}—this        identified object may then be considered to be the object        depicted in the current image F_(j).    -   S_(k,b) may comprise an indication of one object O_(k,b) from        the set of objects {O₁, O₂, . . . , O_(T)} that the second        neural network NN_(2,k) determines to be the most likely object        depicted in the positive candidate image C_(k) _(b) , together        with an associated confidence value γ_(k,b) (e.g. a number in        the range 0 to 1) indicating a degree of confidence that it is        that object O_(k,b) that is depicted in the positive candidate        image C_(k) _(b) . In the following, it is assumed that higher        confidence values are indications of a higher likelihood that it        is that object O_(k,b) that is depicted in the positive        candidate image C_(kb)—however, it will be appreciated that the        opposite may be true, and that embodiments of the invention may        be adapted accordingly. The results S_(k,b) (k=1, 2, . . . , M;        b=1, 2, . . . , L) may then be combined in a number of ways. For        example, the object O_(k,b) with the highest confidence value        γ_(k,b) may be considered to be the object depicted in the        current image F_(j). Alternatively, for each object O_(x) (x=1,        2, . . . , T), a corresponding confidence value γ_(x) for that        object can be determined as the sum of the confidence values        γ_(k,b) for which O_(x)=O_(k,b) (k=1, 2, . . . , M; b=1, 2, . .        . , L)—then the object O_(x) with the highest confidence value        γ_(x) may be considered to be the object depicted in the current        image F_(j).    -   S_(k,b) may comprise an indication, for each object O_(x) (x=1,        2, . . . , T), an associated confidence value γ_(k,b,x) (e.g. a        number in the range 0 to 1) indicating a degree of confidence        that it is that object O_(x) that is depicted in the positive        candidate image C_(k) _(b) . In the following, it is assumed        that higher confidence values are indications of a higher        likelihood that it is that object O_(x) that is depicted in the        positive candidate image C_(k) _(b) —however, it will be        appreciated that the opposite may be true, and that embodiments        of the invention may be adapted accordingly. The results S_(k,b)        (k=1, 2, . . . , M; b=1, 2, . . . , L) may then be combined in a        number of ways. For example, for the object O_(x) (x=1, 2, . . .        , T), an overall confidence value γ_(x) for that object can be        determined as a product (π_(k=1) ^(M) Π_(b=1) ^(L)γ_(k,b,x)) or        a linear combination (Σ_(k=1) ^(M)Σ_(b=1) θ_(k,b) γ_(k,b,x) for        some positive coefficients θ_(k,b)) of the confidence values        γ_(k,b,x) (k=1, 2, . . . , M; b=1, 2, . . . , L). Then the        object O_(x) with the highest confidence value γ_(x) may be        considered to be the object depicted in the current image F_(j).

Thus, at a step 412, the ensemble 210 combines the results S_(k,b) fromthe second neural networks (k=1, 2, . . . , M; b=1, 2, . . . , L) toidentify an object of the predetermined type in the current image F_(j).This may be carried out in the manner set out above. However, it will beappreciated that the result S_(k,b) (k=1, 2, . . . , M; b=1, 2, . . . ,L) may take other forms, and that other methods could be used for usingthe results S_(k,b) (k=1, 2, . . . , M; b=1, 2, . . . , L) to identifywhich object of the predetermined type is depicted in the current imageF_(j).

Thus, together, the steps 410 and 412 involve, in response to the firstneural network NN₁ determining that an object of the predetermined typeis depicted within the current image F_(j), using the ensemble 210 ofsecond neural networks NN_(2,k) (k=1, 2, . . . , M) to identify theobject determined as being depicted within the current image F_(j).

It will be appreciated that the step 410 may involve the second neuralnetworks NN_(2,k) (k=1, 2, . . . , M) using candidate images C_(k) otherthan, or in addition to, the positive candidate images. For example, thesecond neural networks NN_(2,k) (k=1, 2, . . . , M) may analyze all ofthe candidate images C_(k) (k=1, 2, . . . , N).

In some embodiments, at the step 412, the ensemble 210 may not be ableto identify which object of the predetermined type is depicted in thecurrent image F₁. For example, in the above-mentioned embodiment inwhich the object O_(k,b) with the highest confidence value γ_(k,b) isconsidered to be the object depicted in the current image F_(j), such anembodiment may make use of a predetermined threshold for β₄ such that ifthat highest confidence value γ_(k,b) exceeds β₄, then the objectO_(k,b) is identified whereas if that highest confidence value γ_(k,b)does not exceed β₄ then the ensemble 210 does not identify any object asbeing depicted within the current image F_(j) (so that objectidentification has not been successful). Likewise, in theabove-mentioned embodiment in which the object O_(x) with the highestconfidence value γ_(x) is considered to be the object depicted in thecurrent image F_(j), such an embodiment may make use of a predeterminedthreshold for β₄ such that if that highest confidence value γ_(x)exceeds β₄, then the object O_(x) is identified whereas if that highestconfidence value γ_(x) does not exceed β₄ then the ensemble 210 does notidentify any object as being depicted within the current image F_(j) (sothat object identification has not been successful). It will beappreciated that other mechanisms for the ensemble 210 to determinewhether the object identification has been successful could be used.

Thus, in some embodiments, at an optional step 413, the ensemble 210determines whether an object has been successfully identified. If theensemble 210 determines that an object has been successfully identified,processing may continue at an optional step 414 (or, in the absence ofsuch a step, at the step 418); otherwise, processing may continue at theoptional step 416 (or, in the absence of such a step, at the step 418).In the absence of the step 413, processing may continue at an optionalstep 414 (or, in the absence of such a step, at the step 418).

At the optional step 414, the system 200 may carry out processingspecific to the situation in which an object of the predetermined typeis detected and identified in the current image F_(j). For example, inthe “broadcaster logo” scenario, the video sequence may be anunauthorized copy of a broadcaster's content and, if a broadcaster'slogo is detected and identified, then measures may be taken in relationto that unauthorized copy (e.g. alerting the broadcaster associated withthat logo). Processing then continues at the step 418.

At the step 418, one or more further actions may be taken. For example alog of the results of the method 400 may be updated (e.g. to store dataindicating whether, for the frame F_(j), an object of the predeterminedtype was detected and, if so, which object was identified). Likewise, atthe step 418, processing may return to the step 402 at which a nextimage from the video sequence 202 may be processed—this next image maybe the image F_(j+1) (i.e. the immediate successor of the current imageF_(j)) or some other image of the video sequence 202.

One advantage to using the particular structure for the system 200illustrated in FIG. 2 when carrying out the method 400 of FIG. 4 isthat: (a) a single neural network NN₁ is used to detect whether or notan object of the predetermined type is present, which does not consumeas much processing resources as using an ensemble of neural networks;but (b) once an object of the predetermined type has been detectedwithin the current image F₁ (which may be less often than every imagefrom the video sequence 202), more processing resources can be appliedto the task of identifying that object, via the ensemble 210 of neuralnetworks NN_(2,k) (k=1, 2, . . . , M), with use of the ensemble 210providing for a greater degree of accuracy. Use of the ensemble 210helps prevent the system 200 becoming over-fitted based on the set ofsamples S₁. Together, this helps enable the system 200 perform objectdetection and identification/recognition for video sequences on aframe-by-frame basis, rather than having to combine, and wait for,results compiled across multiple video frames.

FIG. 5 schematically illustrates the generation of the candidate imagesC_(k) (k=1, 2, . . . , N) by the candidate image generator 206 at theoptional step 404. The current image F_(j) is shown as an image 500. Letthe height of the image 500 be H and the width of the image 500 be W.

This image 500 may form one of the candidate images C_(k).

Four candidate images 502 can be generated by dividing the originalimage 500 into four non-overlapping tiles of size H/2×W/2, and thenresizing these tiles to the original dimensions H×W of the originalimage 500.

Nine candidate images 504 can be generated by dividing the originalimage 500 into nine non-overlapping tiles of size H/3×W/3, and thenresizing these tiles to the original dimensions H×W of the originalimage 500.

It will be appreciated that this process can be used to form a hierarchyof candidate images C_(k) at different levels. For a positive integer z,the z^(th) level may be formed by dividing the original image 500 intoz² non-overlapping tiles of size H/z×W/z, and then resizing these tilesto the original dimensions H×W of the original image 500 to formcorresponding candidate images. Thus, the 1^(st) level comprises theoriginal image 500, the 2^(nd) level comprises the images 502, the thirdlevel comprises the images 504, etc. The set of candidate images C_(k)(k=1, 2, . . . , N) may comprise images from one or more levels. In someembodiments, all levels from level 1 to Z are used for some positiveinteger Z, with all of the images at each of those levels being used ascandidate images. However, this is not essential—some embodiments maymake use of non-consecutive levels and/or some embodiments do notnecessarily use all of the images from a given level as candidateimages.

It will also be appreciated that different tiling schemes could be used.For example, some or all of the tiles used for any given level may beoverlapping instead of non-overlapping, and/or the tiles need notnecessarily be of the same size.

In some embodiments, one or more geometric transformations (e.g. shear,rotation, stretch, scaling, etc.) may be applied to the original image500 before generating the tiles used for a given level of the hierarchywhen generating some or all of the candidate images C_(k) (k=1, 2, . . ., N).

Preferably, the resultant candidate images C_(k) are of the same sizeH×W.

In summary, each candidate image C_(k) (k=1, 2, . . . , N) is an imagecorresponding to an area of the original image 500 and that, if thecandidate image C_(k) is not the whole of the original image 500, hasundergone one or more geometric transformations. Put another way, eachcandidate image C_(k) (k=1, 2, . . . N) is a version of at least a part(or an area) of the original image 500. The set of candidate imagesC_(k) (k=1, 2, . . . , N) form a group of test images corresponding tothe original image 500.

Use of the candidate image generator 206 and the step 404 helps toaddress various problems, including: the initial training sets S₁ and S₂may have used images at a resolution that is different from theresolution of the images F_(j) of the video sequence 202; the objectsmay be depicted in the samples of the initial training sets S₁ and S₂ atangles/orientations/positions different from how the objects aredepicted in the images F; of the video sequence 202. Thus, use of thecandidate image generator 206 and the step 404 helps to mitigate thesedifferences between training and actual use, to thereby help improve theoverall accuracy (from both a false positive and false negativeperspective) of the system 200.

4—Example Use Cases

The system 200 and the method 400 may be used in various different waysand for various different purposes. FIG. 6 schematically illustrates anexample deployment scenario 600 for the system 200 according to someembodiments of the invention.

The system 200 may form part of a larger system 602. The system 602 maycomprise a database 604 (or repository or storage) for storing videosequences 202 to be analysed by the system 200. Thus the input 204 ofthe system 200 may obtain the current image F_(j) from a video sequence202 stored in the database 604.

Additionally or alternatively, the input 204 of the system 200 may bearranged to receive or obtain images F_(j) of the video sequence 202 viaa network 610 from a source 606 of the video sequence 202. Whilst FIG. 6illustrates a single source 606, it will be appreciated that the system200 may be arranged to receive or obtain images F_(j) of video sequences202 via one or more networks 610 from multiple sources 606. The network610 may be any kind of data communication network suitable forcommunicating or transferring data between the source 606 and the system200. Thus, the network 610 may comprise one or more of: a local areanetwork, a wide area network, a metropolitan area network, the Internet,a wireless communication network, a wired or cable communicationnetwork, a satellite communications network, a telephone network, etc.The source 606 and the system 200 may be arranged to communicate witheach other via the network 610 via any suitable data communicationprotocol. For example, when the network is the Internet, the datacommunication protocol may be HTTP. The source 606 may be any system orentity providing or supplying the video sequence 202. For example, thesource 606 may comprise a television broadcaster, a digital televisionhead-end, a cable or satellite television head-end, a web-basedvideo-on-demand provider, a peer-to-peer network for sharing videosequences, etc. Thus the input 204 of the system 200 may obtain thecurrent image F_(j) from a video sequence 202 available from (orprovided by) the source 606 via the network 610.

In some embodiments, the system 602 is arranged to obtain videosequences 202 via the network 610 from one or more sources 606 and storethose video sequences 202 in the database 604 for subsequent analysis bythe system 200.

In some embodiments, the entity interested in the results of the objectdetection and identification carried out by the system 200 is theoperator of the larger system 602 and/or the source 606. However,additionally or alternatively, in other embodiments, one or moredifferent entities 608 may be the entity interested in the results ofthe object detection and identification carried out by the system 200,in which case the results of the method 400 carried out by the system200 may be communicated to the one or more entities 608 (e.g. via thenetwork 610).

In some embodiments, as will be apparent from the discussion below, thesource 606 of the video sequence 202 may be the same as the system 602.

FIG. 7 is a flowchart illustrating an example method 700 according tosome embodiments of the invention.

At a step 702, images of a video sequence 202 are obtained from a source606.

At a step 704, the system 200 is used to (try to) detect and identify anobject depicted in the images of the video sequence 202.

If the system 200 does not detect and identify an object within theimages of the video sequence 202, then at a step 706, processing isreturned to the step 702 at which either further images of the videosequence 202 are obtained or at which images of a different videosequence 202 may be obtained. Alternatively, processing for the method700 may be terminated.

If the system 200 does detect and identify an object within the imagesof the video sequence 202, then at a step 708, one or more entities maybe informed that the identified object has been detected in a videosequence obtained from the source 606. Additionally or alternatively,one or more different measures may be taken. Processing is returned tothe step 702 at which either further images of the video sequence 202are obtained or at which images of a different video sequence 202 may beobtained. Alternatively, processing for the method 700 may beterminated.

In one example use-case scenario 600, the predetermined object type is alogo. The logo may be, for example, a logo of a television broadcaster,which is often depicted in (or overlaid onto) broadcast televisionimages (usually in one of the corners of the images). Alternatively, thelogo may be a logo indicating an origin of, or an owner of rights (e.g.copyright) in, the video sequence 202, or an indication of a channel forthe television images. Thus, the system 200 may be used to test a videosequence 202 to see whether the source 606 of the video sequence 202 isauthorized to provide that video sequence 202. Put another way, it maybe known that the source 606 is not authorized to use video sequences202 from a particular broadcaster or content provider, and thebroadcaster or content/rights owner/provider may wish to check whetherthat source 606 is providing their video sequences 202. Thus, the system200 may be used to detect and identify a logo depicted in a videosequence 202 obtained from a source 606. The set of objects {O₁, O₂, . .. , O_(T)} may therefore be a set of specific logos of interest (e.g. aset of logos of broadcasters or content/rights owners who wish to detectunauthorized use of their content). The video sequence 202 may, forexample, be obtained at the system 200 in real-time (such as during alive broadcast or distribution by the source 606), or not in real-time,e.g. as a download from the source 606 (in which case the video sequence202 may be stored in the database 604 for subsequent analysis by thesystem 200). If the system 200 identifies a logo in an image of thevideo sequence 202, then the system 200 may report to one or moreentities (e.g. a television broadcaster associated with the identifiedlogo; the owner of the copyright in the video as indicated by theidentified logo; the police; etc.) that the video sequence 202 depictingthat logo has been obtained from the source 606. Those one or moreentities may then take appropriate measures, such as measures to preventfurther broadcast/distribution of the video sequence 202, furtherinvestigation to gather more evidence regarding the unauthorizeduse/provision of video sequences, etc.

In one example use-case scenario 600, the predetermined object type isan advertiser logo (or brand or trade mark), for example, a logodisplayed on sports clothing worn by athletes, hoardings, vehicles, etc.Advertisers may wish to know how often their advertising is shown in avideo sequence (e.g. so that they can modify their advertising schemesor provide sponsorship payments accordingly). Thus, the system 200 maybe used to detect and identify a logo depicted in a video sequence 202obtained from a source 606. The set of objects {O₁, O₂, . . . , O_(T)}may therefore be a set of specific logos of interest (e.g. a set oflogos of advertisers who wish to detect display of their advertising invideo content). The video sequence 202 may, for example, be obtained atthe system 200 in real-time (such as during a live broadcast ordistribution by the source 606), or not in real-time, e.g. as a downloadfrom the source 606 (in which case the video sequence 202 may be storedin the database 604 for subsequent analysis by the system 200). If thesystem 200 identifies a logo in an image of the video sequence 202, thenthe system 200 may report to one or more entities (e.g. an advertiserassociated with the logo) that the video sequence 202 depicting thatlogo has been obtained from the source 606. Those one or more entitiesmay then take appropriate measures as set out above.

In one example use-case scenario 600, the predetermined object type is ahuman face. For example, for a video sequence 202 of a sports match(such as rugby or football), it may be desirable to identify whichparticular player(s) are being shown (e.g. to assist commentary, matchstatistics generation, metadata generation, etc). Likewise, the videosequence 202 may be the whole or part of a movie, and it may bedesirable to identify which particular actor(s) are being shown (e.g. toassist in metadata generation, such as which actors are present in whichscenes, how often an actor is on-screen, etc.). Likewise, the videosequence 202 may be footage from video cameras (such as CCTV cameras),and it may be desirable to detect whether or not particular people (e.g.wanted criminals, lost people, etc.) are being shown (e.g. to assist thepolice/authorities with their activities in finding particular people).Thus, the system 200 may be used to detect and identify a face depictedin a video sequence 202 obtained from a source 606. Indeed, the videosequence 202 may be provided by the same system 602 that operates thesystem 200—for example, a broadcaster may be generating live video of asports event and may also be using the system 200 to identifycompetitors participating in that sports event. The set of objects {O₁,O₂, . . . , O_(T)} may therefore be a set of specific faces of interest(e.g. a set of faces of people of interest, e.g. known rugby playerswhen the system is being used for analysis of rugby matches). The videosequence 202 may, for example, be obtained at the system 200 inreal-time (such as during a live broadcast or distribution by the source606), or not in real-time, e.g. as a download from the source 606 (inwhich case the video sequence 202 may be stored in the database 604 forsubsequent analysis by the system 200). If the system 200 identifies aface in an image of the video sequence 202, then the system 200 mayreport this to one or more entities who may then take appropriatemeasures as set out above. In alternative embodiments, instead oftraining on a faces, the system 200 could be trained on a larger part ofa person (e.g. the whole of a person), so that the predetermined type isthen “person”. The detection of people or faces may be used for anycategory of people, such as: actors and actresses, sports players,sports personalities, TV presenters, TV personalities, politicians, etc.

In one example use-case scenario 600, the predetermined object type is atype of animal (e.g. a mouse). For example, for a video sequence 202 ofwildlife footage, it may be desirable to identify which animal(s) arebeing shown—for example, remote cameras may be used to try to capturefootage of a rare animal, and the system 200 may be used to try toidentify when images of such a rare animal have been captured. Thus, thesystem 200 may be used to detect and identify animals depicted in avideo sequence 202 obtained from a source 606. The set of objects {O₁,O₂, . . . , O_(T)} may therefore be a set of specific animals ofinterest. The video sequence 202 may, for example, be obtained at thesystem 200 in real-time (such as during a live broadcast or distributionby the source 606), or not in real-time, e.g. as a download from thesource 606 (in which case the video sequence 202 may be stored in thedatabase 604 for subsequent analysis by the system 200). If the system200 identifies an animal in an image of the video sequence 202, then thesystem 200 may, for example, generate corresponding metadata associatedwith the image.

It will be appreciated that the system 200 may be used to detect thepresence of, and to identify, other types of object or events within thecurrent image F_(j), such as the depiction of a vehicle, a vehiclelicence/number plate, a person, a fire, a game character in videofootage of a computer game, a score board in video footage of a sportingevent, buildings (to thereby enable detection of locations associatedwith the video sequence 202—e.g. detecting the Eiffel Tower in thecurrent image F_(j) indicates that the current image F_(j) is associatedwith Paris), etc.

In some embodiments, the input of the system 204 may be arranged toreceive one or more single images instead of images from a videosequence 202. As mentioned, the system 200 may process images of a videosequence 202 on an image-by-image basis (i.e. independently of eachother). Therefore, it will be appreciated that the system 200 could beused in situations in which the input data is just a single image. Itwill, therefore, be appreciated that the discussion above in relation touse of a video sequence 202 applies analogously to single images. Forexample, image content may be obtained from a webpage in order toanalyse whether an advertiser's logo or trade mark is being used withinthat image content.

In some embodiments, the system 200 may be arranged to operate on otheramounts of content. For example, the system 200 may be arranged toreceive and process audio data as an amount of content instead of imageor video data as amounts of content. In particular, the input of thesystem 204 may be arranged to receive one or more amounts of audio data(e.g. one second audio snippets from an audio stream) instead of one ormore single images or instead of a video sequence 202. Thus, it will beappreciated that, in the description above, references to “videosequence” 202 may be replaced by references to “audio sequence” 202, andreferences to “image” may be replaced by “audio sample/snippet”, etc.Thus, the neural networks NN₁ and NN_(2,k) (k=1, 2, . . . , M) may havebeen trained on audio data samples, with a view to the first neuralnetwork NN₁ detecting the presence of an audio pattern or characteristicof a predetermined type within an amount of audio data and with a viewto each of the second neural networks NN_(2,k) (k=1, 2, . . . , M)identifying which audio pattern or characteristic of the predeterminedtype is present in the amount of audio data.

In one example use-case scenario 600, the predetermined object type is avoice or other type of noise/sound. For example, for an audio sequence202, it may be desirable to identify which people are speaking—forexample, in a radio broadcast, it may be desirable to identify whichbroadcasters are speaking, which music artists are being played, etc. Asanother example, for an audio sequence 202, it may be desirable toidentify the sound of alarms, radio jingles (which could, for example,identify the specific source or rights holder of the audio sequence202), or other events. Thus, the system 200 may be used to detect andidentify voices or other noises depicted in an audio sequence 202obtained from a source 606. The set of objects {O₁, O₂, . . . , O_(T)}may therefore be a set of specific voices or noises of interest. Theaudio sequence 202 may, for example, be obtained at the system 200 inreal-time (such as during a live broadcast or distribution by the source606), or not in real-time, e.g. as a download from the source 606 (inwhich case the audio sequence 202 may be stored in the database 604 forsubsequent analysis by the system 200). If the system 200 identifies avoice or noise in sample of the audio sequence 202, then the system 200may, for example, generate corresponding metadata associated with thesample.

In one example use-case scenario 600, the predetermined object type is aword or phrase. For example, for an audio sequence 202, it may bedesirable to identify which particular words or phrases occur in theaudio sequence 202 (e.g. to enable automatic subtitling, searchingthrough content based on keywords, identifying when an event hasoccurred such as when a sports commentator shouts “Goal!”, etc.). Thus,the system 200 may be used to detect and identify words present in anaudio sequence 202 obtained from a source 606. The set of objects {O₁,O₂, . . . , O_(T)} may therefore be a set of specific words of interest.The audio sequence 202 may, for example, be obtained at the system 200in real-time (such as during a live broadcast or distribution by thesource 606), or not in real-time, e.g. as a download from the source 606(in which case the audio sequence 202 may be stored in the database 604for subsequent analysis by the system 200). If the system 200 identifiesa word in sample of the audio sequence 202, then the system 200 may, forexample, generate corresponding metadata associated with the sample,generate subtitles, provide search results, etc.

The content (video sequence 202, images and/or audio data) processed bythe system 200 may originate from a variety of sources 606, includinglive or recorded content, computer generated content (such as gamecontent), augmented reality, virtual reality, etc.

Thus, the system 200 may be used in a variety of way, including:

-   -   generating metadata from content    -   detecting the broadcaster or channel associated with content    -   detecting people, game characters, etc. in content    -   detecting location in, or associated with, content    -   detecting logos, brands and/or advertising in content, which        may, for example, be used to facilitate measuring brand impact        in content or measuring advertising impact in content    -   detecting fraudulent advertising    -   detecting pirated content    -   finding fraudulent goods and services by identifying such        goods/services in images or video, i.e. brand protection    -   video annotation based on the detection and identification of        audio or visual events    -   identifying TV, movie and sports content    -   annotation of TV, movie and sports content    -   searching through content using keywords, phrases, etc. (e.g.        searching through footage of a sports match to identify when a        commentator said “Goal” or mentioned a particular player by        name)    -   searching through video based on the appearance of a person or        character    -   identifying a movie or video based upon the appearance of        characters on screen.

5—Modifications

It will be appreciated that the methods described have been shown asindividual steps carried out in a specific order. However, the skilledperson will appreciate that these steps may be combined or carried outin a different order whilst still achieving the desired result.

It will be appreciated that embodiments of the invention may beimplemented using a variety of different information processing systems.In particular, although the figures and the discussion thereof providean exemplary computing system and methods, these are presented merely toprovide a useful reference in discussing various aspects of theinvention. Embodiments of the invention may be carried out on anysuitable data processing device, such as a personal computer, laptop,personal digital assistant, mobile telephone, set top box, television,server computer, etc. Of course, the description of the systems andmethods has been simplified for purposes of discussion, and they arejust one of many different types of system and method that may be usedfor embodiments of the invention. It will be appreciated that theboundaries between logic blocks are merely illustrative and thatalternative embodiments may merge logic blocks or elements, or mayimpose an alternate decomposition of functionality upon various logicblocks or elements.

It will be appreciated that the above-mentioned functionality may beimplemented as one or more corresponding modules as hardware and/orsoftware. For example, the above-mentioned functionality may beimplemented as one or more software components for execution by aprocessor of the system. Alternatively, the above-mentionedfunctionality may be implemented as hardware, such as on one or morefield-programmable-gate-arrays (FPGAs), and/or one or moreapplication-specific-integrated-circuits (ASICs), and/or one or moredigital-signal-processors (DSPs), and/or one or more graphicalprocessing units (GPUs), and/or other hardware arrangements. Methodsteps implemented in flowcharts contained herein, or as described above,may each be implemented by corresponding respective modules; multiplemethod steps implemented in flowcharts contained herein, or as describedabove, may be implemented together by a single module.

It will be appreciated that, insofar as embodiments of the invention areimplemented by a computer program, then one or more storage media and/orone or more transmission media storing or carrying the computer programform aspects of the invention. The computer program may have one or moreprogram instructions, or program code, which, when executed by one ormore processors (or one or more computers), carries out an embodiment ofthe invention. The term “program” as used herein, may be a sequence ofinstructions designed for execution on a computer system, and mayinclude a subroutine, a function, a procedure, a module, an objectmethod, an object implementation, an executable application, an applet,a servlet, source code, object code, byte code, a shared library, adynamic linked library, and/or other sequences of instructions designedfor execution on a computer system. The storage medium may be a magneticdisc (such as a hard drive or a floppy disc), an optical disc (such as aCD-ROM, a DVD-ROM or a BluRay disc), or a memory (such as a ROM, a RAM,EEPROM, EPROM, Flash memory or a portable/removable memory device), etc.The transmission medium may be a communications signal, a databroadcast, a communications link between two or more computers, etc.

1. A method for identifying an object within a video sequence, whereinthe video sequence comprises a sequence of images, wherein the methodcomprises, for each of one or more images of the sequence of images:using a first neural network to determine whether or not an object of apredetermined type is depicted within the image; and in response to thefirst neural network determining that an object of the predeterminedtype is depicted within the image, using an ensemble of second neuralnetworks to identify the object determined as being depicted within theimage.
 2. The method of claim 1, wherein the first neural network and/orone or more of the second neural networks is a convolutional neuralnetwork or a deep convolutional neural network.
 3. (canceled)
 4. Themethod of claim 1, wherein using a first neural network to determinewhether or not an object of a predetermined type is depicted within theimage comprises: generating a plurality of candidate images from theimage; using the first neural network to determine, for each of thecandidate images, an indication of whether or not an object of thepredetermined type is depicted in said candidate image; and using theindications to determine whether or not an object of the predeterminedtype is depicted within the image.
 5. The method of claim 4, wherein oneor more of the candidate images is generated from the image byperforming one or more geometric transformations on an area of theimage.
 6. The method of claim 1, wherein the predetermined type is alogo.
 7. The method of claim 1, wherein the predetermined type is a faceor a person.
 8. The method of claim 1, comprising associating metadatawith the image based on the identified object.
 9. The method of claim 6,comprising: obtaining the video sequence from a source; and determiningunauthorized use of the video sequence based on identifying that thelogo is depicted within one or more images of the video sequence. 10.The method of claim 9, wherein the logo is one of a plurality ofpredetermined logos.
 11. A method for identifying an object within anamount of content, the method comprising: using a first neural networkto determine whether or not an object of a predetermined type isdepicted within the amount of content; and in response to the firstneural network determining that an object of the predetermined type isdepicted within the amount of content, using an ensemble of secondneural networks to identify the object determined as being depictedwithin the amount of content.
 12. The method of claim 11, wherein theamount of content is one of: (a) an image; (b) an image of a videosequence that comprises a sequence of images; and (c) an audio snippet.13. The method of claim 11, wherein the first neural network and/or oneor more of the second neural networks is a convolutional neural networkor a deep convolutional neural network.
 14. (canceled)
 15. The method ofclaim 11, wherein using a first neural network to determine whether ornot an object of a predetermined type is depicted within the amount ofcontent comprises: generating a plurality of content candidates from theamount of content; using the first neural network to determine, for eachof the content candidates, an indication of whether or not an object ofthe predetermined type is depicted in said content candidate; and usingthe indications to determine whether or not an object of thepredetermined type is depicted within the amount of content.
 16. Themethod of claim 15, wherein one or more of the content candidates isgenerated from the amount of content by performing one or more geometrictransformations on a portion of the amount of content.
 17. The method ofclaim 11, wherein the amount of content is an audio snippet and thepredetermined type is one of: a voice; a word; a phrase.
 18. The methodof claim 11, comprising associating metadata with the amount of contentbased on the identified object.
 19. An apparatus comprising one or moreprocessors, the one or more processors being arranged to carry outidentification of an object within an amount of content, saididentification comprising: using a first neural network to determinewhether or not an object of a predetermined type is depicted within theamount of content; and in response to the first neural networkdetermining that an object of the predetermined type is depicted withinthe amount of content, using an ensemble of second neural networks toidentify the object determined as being depicted within the amount ofcontent.
 20. (canceled)
 21. A non-transitory computer-readable mediumstoring a computer program which, when executed by one or moreprocessors, causes the one or more processors to carry outidentification of an object within an amount of content, saididentification comprising: using a first neural network to determinewhether or not an object of a predetermined type is depicted within theamount of content; and in response to the first neural networkdetermining that an object of the predetermined type is depicted withinthe amount of content, using an ensemble of second neural networks toidentify the object determined as being depicted within the amount ofcontent.
 22. The apparatus of claim 19, wherein the amount of content isone of: (a) an image (b) an image of a video sequence that comprises asequence of images; and (c) or an audio snippet.
 23. The non-transitorycomputer-readable medium of claim 21, wherein the amount of content isone of: (a) an image (b) an image of a video sequence that comprises asequence of images; and (c) or an audio snippet.